Predicting Daily STOCK 1 Values at Market Close¶


AUTHOR: Vassil Dimitrov, PhD
START DATE: 2023-07-30

Contacts:
LinkedIn: linkedin.com/in/vassilDim
email: vasssildimitrov@gmail.com

Several models will be built based on the historical data for share values of Stock 1 in order to predict future values (on a daily basis). More specifically, 3 models of increasing complexity will be considered constructed, optimized and tested:

  1. Time series with SARIMAX
  2. Time series with TBATS, which can capture more than 1 seasonal trends
  3. LSTM RNN deep networks

OUTLINE

  1. Module 1 - LOAD & CLEAN DATA

    • read in the data
    • explore data
    • deal with missing and duplicated values
  2. Module 2 - EXPLORATORY DATA ANALYSIS

    • visualize the data
    • subset the data
    • comment on any trends
  3. Module 3 - PREDICTION WITH SARIMAX
    • optimize hyperparameters
    • fit optimal model
    • assess performance daily
  4. Module 4 - PREDICTION WITH TBATS
    • optimize hyperparameters
    • fit optimal model
    • assess performance daily
  5. Module 4 - PREDICTION WITH LSTM RNN DEEP NETS
    • optimize hyperparameters and network architecture
    • fit optimal model
    • assess performance daily

In [1]:
 

MODULE 1 -- Load & Clean Data¶

In [2]:
 
date open high low close volume
0 1997-02-27 288.0002 288.0002 282.0002 285.1202 194
1 1997-02-28 282.0002 285.1202 282.0002 285.1202 25
2 1997-03-03 288.0002 288.0002 279.1202 279.1202 142
3 1997-03-04 278.8802 278.8802 278.8802 278.8802 0
4 1997-03-05 282.0002 285.1202 282.0002 285.1202 136

One can see that the data has several variables for which it keeps track: date of record date, value at opening open, value at closing close, highest daily value high, lowest daily value low and daily volume of transactions volume. Intuitively, open and close should be quite correlated, which will be shown in Module 2 - EDA.

Several steps will be taken to make sure that the data is ready for modelling:

  1. Appropriate column names
  2. Appropriate data types for each feature/column
    • numeric
    • datetime
  3. No missing values
    • detect missing values and impute
  4. No duplications
In [3]:
 
Feature types after cleaning:
 date      datetime64[ns]
open             float64
high             float64
low              float64
close            float64
volume             int64
dtype: object 
--------------------

Number of null values for unmodified data:
 open      0
high      0
low       0
close     0
volume    0
dtype: int64 
--------------------

There are a total of 244 missing business days.

The missing business days will be imputed in a forward-fill fashion.

Null values:
 open      244
high      244
low       244
close     244
volume    244
dtype: int64

We can observe that all the features are in numeric format and the few null values present in the dataset were successfully imputed.


MODULE 2 -- EXPLORATORY DATA ANALYSIS¶

The first step is to simply plot the values for close and open in order to be able to zero-in on any obvious trends.

The Entire Data (Stock 1)¶

In [4]:
 

Below are the main points observed for the data:

Quite a lot of variability initially
The values seem to stabilize at ~2018
The close and open values appear to be quite correlated. Therefore one could focus exclusively on close for modelling.

Data Starting At 2020 (Stock 1)¶

In [5]:
 

Only data from 2020 onwards will be considered for downstream analysis.

In [6]:
 

The correlation plot above confirms that open and close values are virtually perfectly correlated and only one is needed for predicting the daily value of the shares (close).

Let us quickly look at the distribution of the variables in the dataset.

In [7]:
 

volume, close, and openare right-skewed. These values can be transformed by taking their log. This will make them more normally distributed, if one wished to fit a linear regression, which we will abstain from at the present moment.

Next, seasonality is determined by applying seasonal decomposition as plotted below

In [8]:
 

Clearly, the data is static (residuals centred around 0) even though one can observe some heteroscedasticity (RESIDUALS plot).
There is a bit of an oscilating trend in the TREND plot that remains after factoring out the seasonal component.
Finally, there is a clear seasonality as illustrated by the SEASONAL plot.
The model will be optimized so that it captures as much of the trend and seasonal components in making future predictions.

MODULE 3 -- TIME SERIES WITH SARIMAX¶

The parameters for fitting the time series model using SARIMAX is discussed in detail in Stocks_1_SARIMA.ipynb and the associated optimisation code provided in the Stocks_1_SARIMA.py script. Below, the performance of the optimal model will be assessed on the training data, the validation data (walk-forward validation) and on new daily data that is obtained after the model was trained.


AUTHOR: Vassil Dimitrov, PhD

END DATE: 2023-08-09
</br>